In [3]:
%%capture
%run "14 - Simple Linear Regression.ipynb"
Suppose we put forth a new hypothesis that makes use of more independent variables: $$minutes = \alpha + \beta_1 friends + \beta_2 work\ hours + \beta_3 phd + \epsilon$$
In multiple regression, we can denote a vector or parameters as $\beta$. Then we use vectors for x and beta:
beta = [alpha, beta_1, ..., beta_k]
x_i = [1, x_i1, ..., x_ik]
beta = [ 0.9,
0.7,
0.8,
0.6]
x = [1, # constant term,
49, # number of friends
4, # work hours per day
0] # does not have a PhD
In Simple Linear Regression, our model was: $$y_i = \alpha + \beta x_i + \epsilon_i$$ In a multiple regression we will use: $$y_i = \alpha + \beta_1 x_{i1} + ... + \beta_k x_{ik} + \epsilon_i$$
In [4]:
def predict(x_i, beta):
return dot(x_i, beta)
In [5]:
def error(x_i, y_i, beta):
return y_i - predict(x_i, beta)
def squared_error(x_i, y_i, beta):
return error(x_i, y_i, beta)**2
def squared_error_gradient(x_i, y_i, beta):
return [-2 * x_ij * error(x_i, y_i, beta) for x_ij in x_i]
def estimate_beta(x, y):
beta_initial = [random.random() for x_i in x[0]]
return minimize_stochastic(squared_error, squared_error_gradient, x, y, beta_initial, 0.001)
x = [1, # constant term,
49, # number of friends
4, # work hours per day
0] # does not have a PhD
random.seed(0)
beta = estimate_beta([x], daily_minutes_clean)
beta
Out[5]:
Above does not match the book, could not get code to work as shown.
Suppose we have a sample of n number of points and we want to estimate the median but we don't know the distribution. To estimate in this scenario, we can bootstrap new samples from which to generate our estimate in order to increase confidence. We will choose n new data points with replacement and then computer the medians of those generated samples.
In [6]:
def bootstrap_sample(data):
return [random.choice(data) for _ in data]
def bootstrap_statistic(data, stats_fn, num_samples):
return [stats_fn(bootstrap_sample(data)) for _ in range(num_samples)]
In [11]:
close_to_100 = [99.5 + random.random() for _ in range(101)]
far_from_100 = ([99.5 + random.random()] +
[random.random() for _ in range(50)] +
[200 + random.random() for _ in range(50)])
In [12]:
bootstrap_statistic(close_to_100, median, 100)[:5]
Out[12]:
In [13]:
bootstrap_statistic(far_from_100, median, 100)[:5]
Out[13]:
In [14]:
standard_deviation(close_to_100), standard_deviation(far_from_100)
Out[14]:
In [ ]: